Before you can run the code in this notebook you should follow the directions from the README.
In this tutorial we will explore methods for exploration and visualization of large complex datasets using R and Spark. We will cover the following topics:
This tutorial is mainly about visualization, ranging from summaries to more detailed views of the data. A major component of creating visualizations – large or small – from big data is that there is a lot of data manipulation involved. Consequently, a good deal of the tutorial is spent illustrating how to perform a wide variety of operations on data to get it into shape for visualization.
The divide and recombine or D&R method provides a highly scalable approach to analysis of large complex datasets. With D&R we work with meaningful, persistent divisions of the data. “Big data” is typically big because it is made up of collections of many subsets, sensors, locations, time periods, etc. A schematic view of the D&R process is shown in the figure below.
There are many possible ways to divide data. The best choice depends on the nature of the data and the analysis to be performed. Some possibilities include:
Once the data are divided, analytic or visual methods are applied independently to each subset in an embarrassingly parallel fashion. The results of these analyses are recombined to yield a statistically valid D&R result or visualization. We refer to these options as:
In this lesson, our focus is on summary and graphical recombination for the exploration of large complex datasets.
This tutorial focuses on the exploration and visualization of large complex datasets using the D&R paradigm. To do so, we need a massively scalable back-end to perform the large scale data operations. In this case we are using a Spark back-end. The architecture our environment is shown schematically in the figure below.
The components of the architecture are:
It is useful to discuss some of the limitations of this architecture with respect to the D&R paradigm and compare it to other D&R software available to understand which architecture is appropriate for different situations and to discuss what we envision as the future of an ideal D&R architecture.
The D&R project originated as an R front-end to Hadoop, called RHIPE, the R and Hadoop Integrated Programming Environment. This R package allows you to write MapReduce code entirely in R and run it against datasets on Hadoop. MapReduce is not always the most straightforward way to think about processing data, so a companion package, “datadr” was created as a front end for specifying D&R tasks that are translated into MapReduce code. A tutorial on this was given last year at Strata Hadoop World and more about these packages can be found at deltarho.org.
group_by verb and must be specified for every operation, and your data is always tabular throughout the process.While there are some critical big data needs that datadr/RHIPE/Hadoop addresses, given the momentum of both Spark and the Tidyverse, which includes dplyr, and the emergence of using list-columns in data frames to handle arbitrary data, we envision a future D&R environment based on these technologies that can give us arbitrary R execution and arbitrary data structures at scale.
In this tutorial, we use sparklyr and are limited to using it to summarize a larger dataset for the purpose of creating visualizations, which gets us pretty far.
Let’s now move on to some hands-on examples.
Its time to start a Spark cluster and create a connection with sparklyr. In this case, you will start Spark on your local machine. Spark should be installed on your system already from following the installation instructions. For large scale applications, Spark is run on a remote cluster.
The connection object, called sc in this case, manages the connection between your local R session and Spark. You will use references to the Spark connection whenever you send data and commands to Spark or receive results back.
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages --------------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
library(sparklyr)
Attaching package: ‘sparklyr’
The following object is masked from ‘package:purrr’:
invoke
library(trelliscopejs)
library(forcats)
airlines <- readr::read_csv(file.path('data', 'airlines.csv'))
Parsed with column specification:
cols(
carrier = col_character(),
name = col_character()
)
sc <- spark_connect(master = "local")
Now that you have a Spark instance running, you can load the data from the .csv file in your local directory into Spark. If you are working with large scale data, you will need to use the more scalable data loading capabilities of Spark and will not load the data from a .csv file.
You do not load your large dataset into your local R session. The point of the D&R paradigm is to use a massively scalable back end for the heavy lifting. Only the recombined results are collected into the local R session. In this case, we are using Spark for our back-end. Other choices, such as Hadoop, would be suitable as well.
Notice, that the first argument of the command below is sc, a reference to the Spark connection you have started. The name assigned, flights_tbl is a reference you will use in R to access the data in Spark. Execute this code to load the data into your Spark session.
flights_tbl <- spark_read_csv(sc, "flights_csv", "data/flights2016.csv.gz")
This may take a few minutes to run. As noted, flights_tbl is a reference to your data in Spark, but we can treat it in many ways like a data frame in R.
To check that the data was read properly, we can print the object. This pulls a subset of the data into our local R session for viewing.
flights_tbl
This gives us a feel for what variables are in the data and how many records there are. Notice also, that we have about 5.6 million rows of data.
Now that the data has been loaded into Spark we can start our first divide and recombine (D&R) example. The steps of this D&R example are:
group_by operation. In this case, there are 20 groups.summarize verb. These calculations are independent of each other in all respects. They can be done in parallel even on different nodes of a cluster. Any other summary statistics can be computed in parallel as well.arrange verb.Ideally we would have liked to compute quartiles and the median but sparklyr doesn’t support these calculations as part of a dplyr group_by() operation.
The code below, applies a chain of dplyr verbs to the flights_tbl data frame. These operations are performed in Spark and the results transfered to your local R session using the collect verb. Execute this code and examine the result.
cr_arr_delay <- flights_tbl %>%
group_by(carrier) %>%
summarise(
mean_delay = mean(arr_delay),
n = n()) %>%
arrange(mean_delay) %>%
collect()
cr_arr_delay # Print the results
The D&R process has reduced 5.6 million rows of raw data to 12 rows of summary statistics.
For this example, we used the dplyr package with sparklyr. The R dplyr package, combined with sparklyr, is used to script complex data munging and analysis operations in Spark.
verbs.%>%collect verb.Ultimately in this tutorial we want to demonstrate some powerful visualization tools for interactively exploring large datasets in detail, but with a new dataset, it is often best to first look at some high-level summary visualizations that help guide us toward behaviors we might want to inspect in more detail.
Given the summary statistics output from our operation above, calculating the mean delay by carrier, we will create some plots to further explore the relationships in these results.
As a first step, we need to join some human readable names to the summary statistics data frame.
# merge the airline info so we know who the carriers are
cr_arr_delay <- left_join(cr_arr_delay, airlines)
Joining, by = "carrier"
cr_arr_delay
Now that the dataset is prepared, let’s make some simple plots using the ggplot2 package. The code in cell below uses ggplot to explore the mean delay by airline name and the number of flights by airline.
ggplot(cr_arr_delay, aes(fct_reorder(name, mean_delay), mean_delay)) +
geom_point() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab(NULL) +
ylab("Mean Arrival Delay (minutes)")
Note: In this tutorial we assume you have some exposure to the ggplot2 package.
ggplotfunction defines a data frame to operate on.aes function defines the columns to use for the various dimensions of the plot, e.g. x, ycolor, shape.+.Your Turn: We have looked at the mean delay of flights by airline. But how does the mean distance of the flight change by airline? Is there a relationship between a carrier’s mean distance and mean delay? In the space below create and execute code to do the following: - Use sparklyr to compute a new cr_arr_delay data frame, including all the same columns as before but also a new mean_distance column. - Join the airline names to the cr_arr_delay data frame. - Use ggplot2 to plot the mean distance by airline.
cat('Your code goes here')
Your code goes here
As discussed before, it is important to investigate multiple views of a dataset. Now, the question is, what is the relationship between number of flights and mean delay, and mean delay and mean distance of the flights. The code in the cells below displays these plots.
ggplot(cr_arr_delay, aes(mean_delay, n)) +
geom_point() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab('Mean delay in minutes') +
ylab("Number of flights by airline")
# ggplot(cr_arr_delay, aes(mean_delay, mean_distance)) +
# geom_point() +
# theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
# xlab('Mean delay in minutes') +
# ylab("Mean distance in miles")
In the previous example we worked with a fairly simple set of summary statistics. The mean delay, number of flights and mean distance of flights all grouped by a single factor, airline. These relationships give us some interesting insight into these data, but surely, we can learn more about this dataset.
Let’s try another D&R example. In this case we will divide the data both by airline and month. The basic D&R pipeline is similar to the one we used before, but the results are more granular. The code in the cell below performs the following divide and recombine operations:
cr_mn_arr_delay <- flights_tbl %>%
group_by(carrier, month) %>%
summarise(
mean_delay = mean(arr_delay),
mean_distance = mean(distance),
n = n()) %>%
collect() %>%
left_join(airlines) %>%
mutate(month = factor(month))
Joining, by = "carrier"
cr_mn_arr_delay
We now have 12 months of summaries for each of the 12 carriers. Given that we have more values and more variables, there are many ways we might visualize these summaries. In this case we will use a powerful method know variously as a facet plot, conditioned plot, trellis plot, or the method of small multiples.
A faceted or conditioned plot is comprised of a set of sub-plots defined by one or more conditioning variables. The data for each sub-plot is the result of a partitioning based on the values of the conditioning variable. This conditioning operation is, in effect, a group-by operation. This approach allows small multiples of a large complex dataset to be viewed in a systematic and understandable manner.
In effect, the method extends the number of dimensions projected onto a 2d computer display. This property makes conditioned plotting an idea tool for complex datasets with either many variables or records.
The idea of a facet plot has a long history. An early example of using small multiples was used to display some results from the 1870 US census. The plot below combines small multiples with a treemap plot to show proportions of the population in different occupations or attending school,
The small multiples idea was popularized in Edward Tufte’s 1983 book. Bill Cleveland and colleagues at AT&T Bell Labs created the Trellis plotting software package using the S language. Cleveland called this method Trellis Display.
The ggplot2 package contains the facet_grid function which is used to define the grid on which the sub-plots are created. The facet grid function uses an R formula object to define the rows and columns to specify the conditioning variable used to define the rows and columns. The general form of this formula is:
\[RowVariables \sim ColumnVariables\]
A conditioned plot with a single column, but multiple rows, is therefore defined:
\[RowVariables \sim\ .\]
Or, conditioned plot with a single row, but multiple columns, is defined:
\[.\ \sim ColumnVariables\]
You can use multiple variables to condition rows and columns, using the + symbol as the operator:
\[RowVar1 + RowVar2 + \ldots \sim ColVar1 + ColVar2 + \ldots\]
Like all good things in visualization, there are practical limits. Creating a large grid of sub-plots using multiple conditioning variables quickly becomes confusing to look at and understand. Best practice is to use one or two conditioning variables to start with and then to explore the dataset by changing one conditioning variable at a time.
The code in the cell below creates a faceted plot of monthly average flight delay by month. The data in these plots is grouped-by or conditioned on first the name of the airline and then the mean flight delay.
ggplot(cr_mn_arr_delay, aes(month, mean_delay, group = 1)) +
geom_point() +
geom_line() +
facet_grid(~ fct_reorder(name, mean_delay))
There is one plot for each airline, with the mean delay shown by month. These plots have been sorted by the mean delay by airline, so we can focus on the airlines with the greatest average delays. Notice that there are significant changes in the mean delays by month for each airline. Also notice that some airlines have very large jumps in mean delay in the summer months while it is not as pronounced for other airlines.
Your Turn: Next, let’s look at the relationship between the airlines and the number of flights.
scale_y_log10() will create a plot with a log scale on the vertical axis.top6 <- cr_arr_delay %>% filter(n > 400000) %>% .[["carrier"]]
top6
[1] "DL" "UA" "WN" "OO" "AA" "EV"
# overlay them all
cr_mn_arr_delay %>%
filter(carrier %in% top6) %>%
ggplot(aes(month, mean_delay, color = name, group = name)) +
geom_point() +
geom_line()
The faceting examples above were useful in allowing us to examine average delays vs. month by carrier while allowing us to make visual comparisons across carriers. Often it is useful, instead of faceting, to overlay the data for the different groups in a single plot to make more relative comparisons between the groups.
Overlaying data from 12 airlines and trying to be able to visually distinguish between all of them is difficult, and this is one of the reasons faceting is such a good idea - it helps deal with overplotting.
We can sacrifice looking at all the data in a faceted plot to filtering out some of the data to be able to get a more clear picture in a single plot. Looking at the number of flights for each airline, there is a pretty clear separation between the bottom 6 and the top 6. Since the top 6 airlines account for 85% of all flights, they are probably the most interesting airlines to look at, so we will filter our data to compare the top 6 airlines in a single plot in a manageable way. The code in the cell below does the following:
cat('Your code goes here')
Your code goes here
There appears to be a seasonal pattern to the mean delays for all of the top 6 carriers, which is similar for each airline. Of course, more years of data would help us more strongly support this conclusion. We see that in 2016, Delta typically had the best average on time performance, especially in the fall and early winter.
Your Turn: Let’s make the same plot for all airlines.
# group by, origin, dest, carrier, month and get mean delay and # obs
# and pull this back into R
route_summ = flights_tbl %>%
group_by(origin, dest, carrier, month) %>%
summarise(
mean_delay = mean(arr_delay),
n = n()) %>%
filter(n >= 25 & carrier %in% top6) %>%
collect()
route_summ
We have seen an overall seasonal pattern for the top 6 airlines. Now, we are curious whether there is more to this very high-level summary.
Questions: - Are different flight routes more prone to delays? - does variability across airlines change for different routes?
We can visually investigate these questions by creating the same plot as above (mean delay vs. month with each carrier overlaid) for every route. As we will see, there are many routes, too many to look at all at once in a simple ggplot2 faceted plot. For this, we will turn to Trelliscope, which allows us to create large faceted displays and interactively navigate through the panels as we learn what is happening in each subset of the data.
We can get the data into shape for this task by grouping by route (origin and dest), month, and name. Since we are looking at the mean delay, and some routes are traveled more rarely, we want to make sure we have enough data to compute a meaningful statistic. Because of this, we’ll only look at routes that have, for a given route, carrier, and month, more than 25 flights.
We need to create a new grouping of the large dataset using sparklyr. The dplyr code in the cell below defines a sparklyr pipeline performing the following operations:
route_summ2 <- route_summ %>%
left_join(airlines) %>%
rename(carrier_name = name) %>%
mutate(
carrier_name = factor(carrier_name),
month = factor(month))
Joining, by = "carrier"
route_summ2
We have gone from over 5.6 million rows to about 50k rows, two orders of magnitude reduction in size, and plenty small to now be working with in our local R session.
With a little more work, we can get this data more suitable for visualization.
compl_routes <- route_summ2 %>%
group_by(origin, dest, carrier) %>%
summarise(n = n()) %>%
filter(n == 12) %>%
select(-n)
compl_routes
We have one more task to get the data into the state we need for visualization. For each route, we only want to plot data for airlines that recorded flights in all 12 months for that route. We can get a listing of all “complete” carrier/route combinations with the following:
route_summ3 <- right_join(route_summ2, compl_routes)
Joining, by = c("origin", "dest", "carrier")
route_summ3
There are over 3000 route / carrier combinations with a summary value for all 12 months. We can reduce our route summary data to just these combinations by joining compl_routes with route_summ2.
right_join() so that only route / carrier combinations in compl_routes are preserved.cat('Your code goes here')
Your code goes here
There are now about 38k summaries to visualize.
Your Turn: Remember that we want to visualize the mean delay vs. month for each carrier, faceted by route (origin and destination). Can you write some dplyr code to determine how many routes there are in our data?
filter(route_summ3, origin == "LAX" & dest == "JFK") %>%
ggplot(aes(month, mean_delay, color = carrier_name, group = carrier_name)) +
geom_point() +
geom_line() +
ylim(c(-39, 75.5)) +
scale_color_discrete(drop = FALSE)
There are about 2,669 routes for us to visualize. That’s a lot of plots! But we will see how we can easily handle this with Trelliscope.
First, let’s make sure the plot function we were using before works on one route.
mn_arr_delay <- flights_tbl %>%
group_by(month) %>%
summarise(mean_delay = mean(arr_delay)) %>%
collect() %>%
mutate(month = factor(month)) %>%
arrange(month)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
mn_arr_delay
Since we will be making a lot of these plots, let’s also add in a reference line of the overall monthly mean.
filter(route_summ3, origin == "LAX" & dest == "JFK") %>%
ggplot(aes(month, mean_delay, color = carrier_name, group = carrier_name)) +
geom_line(aes(month, mean_delay), data = mn_arr_delay, color = "gray", size = 1, group = 1) +
geom_point() +
geom_line() +
ylim(c(-39, 75.5)) +
scale_color_discrete(drop = FALSE)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
Now let’s add this to our plot.
filter(route_summ3, origin == "ATL") %>%
ggplot(aes(month, mean_delay, color = carrier_name, group = carrier_name)) +
geom_line(aes(month, mean_delay), data = mn_arr_delay, color = "gray", size = 1, group = 1) +
geom_point() +
geom_line() +
ylim(c(-31, 47)) +
scale_color_discrete(drop = FALSE) +
facet_trelliscope(~ origin + dest, nrow = 2, ncol = 4, path = "route_delay_atl")
** note: When inside an R Markdown document, the only way to embed aTrelliscope display within the notebook is to use self_contained = TRUE.
writing panels [=----------------------------------------------------------------] 1% 2/145 eta:20s
writing panels [=----------------------------------------------------------------] 2% 3/145 eta:27s
writing panels [==---------------------------------------------------------------] 3% 4/145 eta:30s
writing panels [==---------------------------------------------------------------] 3% 5/145 eta:34s
writing panels [===--------------------------------------------------------------] 4% 6/145 eta:36s
writing panels [===--------------------------------------------------------------] 5% 7/145 eta:36s
writing panels [====-------------------------------------------------------------] 6% 8/145 eta:36s
writing panels [====-------------------------------------------------------------] 6% 9/145 eta:36s
writing panels [====------------------------------------------------------------] 7% 10/145 eta:36s
writing panels [=====-----------------------------------------------------------] 8% 11/145 eta:36s
writing panels [=====-----------------------------------------------------------] 8% 12/145 eta:35s
writing panels [======----------------------------------------------------------] 9% 13/145 eta:35s
writing panels [======----------------------------------------------------------] 10% 14/145 eta:35s
writing panels [=======---------------------------------------------------------] 10% 15/145 eta:35s
writing panels [=======---------------------------------------------------------] 11% 16/145 eta:35s
writing panels [========--------------------------------------------------------] 12% 17/145 eta:35s
writing panels [========--------------------------------------------------------] 12% 18/145 eta:34s
writing panels [========--------------------------------------------------------] 13% 19/145 eta:34s
writing panels [=========-------------------------------------------------------] 14% 20/145 eta:34s
writing panels [=========-------------------------------------------------------] 14% 21/145 eta:34s
writing panels [==========------------------------------------------------------] 15% 22/145 eta:34s
writing panels [==========------------------------------------------------------] 16% 23/145 eta:33s
writing panels [===========-----------------------------------------------------] 17% 24/145 eta:33s
writing panels [===========-----------------------------------------------------] 17% 25/145 eta:33s
writing panels [===========-----------------------------------------------------] 18% 26/145 eta:33s
writing panels [============----------------------------------------------------] 19% 27/145 eta:33s
writing panels [============----------------------------------------------------] 19% 28/145 eta:33s
writing panels [=============---------------------------------------------------] 20% 29/145 eta:32s
writing panels [=============---------------------------------------------------] 21% 30/145 eta:32s
writing panels [==============--------------------------------------------------] 21% 31/145 eta:32s
writing panels [==============--------------------------------------------------] 22% 32/145 eta:31s
writing panels [===============-------------------------------------------------] 23% 33/145 eta:31s
writing panels [===============-------------------------------------------------] 23% 34/145 eta:31s
writing panels [===============-------------------------------------------------] 24% 35/145 eta:31s
writing panels [================------------------------------------------------] 25% 36/145 eta:30s
writing panels [================------------------------------------------------] 26% 37/145 eta:30s
writing panels [=================-----------------------------------------------] 26% 38/145 eta:30s
writing panels [=================-----------------------------------------------] 27% 39/145 eta:29s
writing panels [==================----------------------------------------------] 28% 40/145 eta:29s
writing panels [==================----------------------------------------------] 28% 41/145 eta:29s
writing panels [===================---------------------------------------------] 29% 42/145 eta:29s
writing panels [===================---------------------------------------------] 30% 43/145 eta:28s
writing panels [===================---------------------------------------------] 30% 44/145 eta:28s
writing panels [====================--------------------------------------------] 31% 45/145 eta:28s
writing panels [====================--------------------------------------------] 32% 46/145 eta:28s
writing panels [=====================-------------------------------------------] 32% 47/145 eta:27s
writing panels [=====================-------------------------------------------] 33% 48/145 eta:27s
writing panels [======================------------------------------------------] 34% 49/145 eta:27s
writing panels [======================------------------------------------------] 34% 50/145 eta:27s
writing panels [=======================-----------------------------------------] 35% 51/145 eta:26s
writing panels [=======================-----------------------------------------] 36% 52/145 eta:26s
writing panels [=======================-----------------------------------------] 37% 53/145 eta:26s
writing panels [========================----------------------------------------] 37% 54/145 eta:26s
writing panels [========================----------------------------------------] 38% 55/145 eta:25s
writing panels [=========================---------------------------------------] 39% 56/145 eta:25s
writing panels [=========================---------------------------------------] 39% 57/145 eta:25s
writing panels [==========================--------------------------------------] 40% 58/145 eta:24s
writing panels [==========================--------------------------------------] 41% 59/145 eta:24s
writing panels [==========================--------------------------------------] 41% 60/145 eta:24s
writing panels [===========================-------------------------------------] 42% 61/145 eta:24s
writing panels [===========================-------------------------------------] 43% 62/145 eta:23s
writing panels [============================------------------------------------] 43% 63/145 eta:23s
writing panels [============================------------------------------------] 44% 64/145 eta:23s
writing panels [=============================-----------------------------------] 45% 65/145 eta:22s
writing panels [=============================-----------------------------------] 46% 66/145 eta:22s
writing panels [==============================----------------------------------] 46% 67/145 eta:22s
writing panels [==============================----------------------------------] 47% 68/145 eta:22s
writing panels [==============================----------------------------------] 48% 69/145 eta:21s
writing panels [===============================---------------------------------] 48% 70/145 eta:21s
writing panels [===============================---------------------------------] 49% 71/145 eta:21s
writing panels [================================--------------------------------] 50% 72/145 eta:21s
writing panels [================================--------------------------------] 50% 73/145 eta:20s
writing panels [=================================-------------------------------] 51% 74/145 eta:20s
writing panels [=================================-------------------------------] 52% 75/145 eta:20s
writing panels [==================================------------------------------] 52% 76/145 eta:19s
writing panels [==================================------------------------------] 53% 77/145 eta:19s
writing panels [==================================------------------------------] 54% 78/145 eta:19s
writing panels [===================================-----------------------------] 54% 79/145 eta:19s
writing panels [===================================-----------------------------] 55% 80/145 eta:18s
writing panels [====================================----------------------------] 56% 81/145 eta:18s
writing panels [====================================----------------------------] 57% 82/145 eta:18s
writing panels [=====================================---------------------------] 57% 83/145 eta:17s
writing panels [=====================================---------------------------] 58% 84/145 eta:17s
writing panels [======================================--------------------------] 59% 85/145 eta:17s
writing panels [======================================--------------------------] 59% 86/145 eta:17s
writing panels [======================================--------------------------] 60% 87/145 eta:16s
writing panels [=======================================-------------------------] 61% 88/145 eta:16s
writing panels [=======================================-------------------------] 61% 89/145 eta:16s
writing panels [========================================------------------------] 62% 90/145 eta:16s
writing panels [========================================------------------------] 63% 91/145 eta:15s
writing panels [=========================================-----------------------] 63% 92/145 eta:15s
writing panels [=========================================-----------------------] 64% 93/145 eta:15s
writing panels [=========================================-----------------------] 65% 94/145 eta:14s
writing panels [==========================================----------------------] 66% 95/145 eta:14s
writing panels [==========================================----------------------] 66% 96/145 eta:14s
writing panels [===========================================---------------------] 67% 97/145 eta:14s
writing panels [===========================================---------------------] 68% 98/145 eta:13s
writing panels [============================================--------------------] 68% 99/145 eta:13s
writing panels [===========================================--------------------] 69% 100/145 eta:13s
writing panels [============================================-------------------] 70% 101/145 eta:12s
writing panels [============================================-------------------] 70% 102/145 eta:12s
writing panels [=============================================------------------] 71% 103/145 eta:12s
writing panels [=============================================------------------] 72% 104/145 eta:12s
writing panels [==============================================-----------------] 72% 105/145 eta:11s
writing panels [==============================================-----------------] 73% 106/145 eta:11s
writing panels [==============================================-----------------] 74% 107/145 eta:11s
writing panels [===============================================----------------] 74% 108/145 eta:10s
writing panels [===============================================----------------] 75% 109/145 eta:10s
writing panels [================================================---------------] 76% 110/145 eta:10s
writing panels [================================================---------------] 77% 111/145 eta:10s
writing panels [=================================================--------------] 77% 112/145 eta: 9s
writing panels [=================================================--------------] 78% 113/145 eta: 9s
writing panels [==================================================-------------] 79% 114/145 eta: 9s
writing panels [==================================================-------------] 79% 115/145 eta: 8s
writing panels [==================================================-------------] 80% 116/145 eta: 8s
writing panels [===================================================------------] 81% 117/145 eta: 8s
writing panels [===================================================------------] 81% 118/145 eta: 8s
writing panels [====================================================-----------] 82% 119/145 eta: 7s
writing panels [====================================================-----------] 83% 120/145 eta: 7s
writing panels [=====================================================----------] 83% 121/145 eta: 7s
writing panels [=====================================================----------] 84% 122/145 eta: 7s
writing panels [=====================================================----------] 85% 123/145 eta: 6s
writing panels [======================================================---------] 86% 124/145 eta: 6s
writing panels [======================================================---------] 86% 125/145 eta: 6s
writing panels [=======================================================--------] 87% 126/145 eta: 5s
writing panels [=======================================================--------] 88% 127/145 eta: 5s
writing panels [========================================================-------] 88% 128/145 eta: 5s
writing panels [========================================================-------] 89% 129/145 eta: 5s
writing panels [========================================================-------] 90% 130/145 eta: 4s
writing panels [=========================================================------] 90% 131/145 eta: 4s
writing panels [=========================================================------] 91% 132/145 eta: 4s
writing panels [==========================================================-----] 92% 133/145 eta: 3s
writing panels [==========================================================-----] 92% 134/145 eta: 3s
writing panels [===========================================================----] 93% 135/145 eta: 3s
writing panels [===========================================================----] 94% 136/145 eta: 3s
writing panels [============================================================---] 94% 137/145 eta: 2s
writing panels [============================================================---] 95% 138/145 eta: 2s
writing panels [============================================================---] 96% 139/145 eta: 2s
writing panels [=============================================================--] 97% 140/145 eta: 1s
building display obj [=============================================================--] 97% 141/145 eta: 1s
writing cognostics [==============================================================-] 98% 142/145 eta: 1s
writing thumbnail [==============================================================-] 99% 143/145 eta: 1s
writing display list [===============================================================] 99% 144/145 eta: 0s
writing app config [===============================================================] 100% 145/145 eta: 0s
The gray line gives us a nice reference point for how the route we are looking at compares to the overall mean monthly delay.
Now, after all this munging, we are finally ready to create a Trelliscope display. Fortunately, creating a trelliscope display is extremely easy. All we need to do is add a faceting directive to our ggplot code. But here we use the function facet_trelliscope().
rr nycflights13::airports
If the above code snippet doesn’t open up a web browser with the resulting plot, you can view it in your web browser with the following command:
nycflights13::airports
When the display opens, you should see something like this:
This is an interactive faceted display that opens in your web browser. There are about 150 routes out of Atlanta that fit our criteria. ~150 panels is too many to display at once, which is why the display is showing the first 8, by default ordered alphabetically by our grouping variables, origin and destination airport code.
Here are some simple interactions you can do with the Trelliscope display:
You will notice some unfamiliar variables in this list, such as mean_delay_mean. Trelliscope inspects the data that you pass in to your ggplot command and automatically computes per-subset metrics that it thinks might be interesting for you to navigate the panels with. One of the variables in our input data, route_summ3, is mean_delay. Trelliscope took this variable and computed the average mean delay for each observation in each route and made it available as a metric to sort and filter on, mean_delay_mean. We call these metrics cognostics. You can sort on the mean_delay_mean cognostic to see what the most consistently early and late routes and airlines are. - Click the “Filter” button on the sidebar to filter the panels to be displayed. A list of cognostic variables is shown and you can select one to filter on. For example , if you select mean_delay_mean, an interactive histogram showing the distribution of this variable appears and you can select panels that, for example, have a negative mean delay. - Click the “Labels” button on the sidebar to control which labels are displayed underneath each panel.
Your Turn: Investigate the Trelliscope display and try to answer the following questions: - Can you find some odd or unexpected behavior in this display? - Do any of the routes seem to follow the seasonal pattern we saw in our aggregate plot? - Did subsetting the data by route provide any additional interesting insights about the data? - What other ways might you think of dividing or visualizing this data that might be interesting? ***
In our previous display, we didn’t have too many variables to filter or sort our panels by, so we may want to add more to our display.
One piece of data we can use to augment our display is metadata about the airports. The nycflights13 R package contains such a dataset.
dest_airports <- nycflights13::airports %>%
rename(dest = faa, dest_name = name, dest_lat = lat, dest_lon = lon, dest_alt = alt,
dest_tzone = tzone) %>%
select(-c(tz, dst))
dest_airports
Since we only are looking at a single origin airport, let’s add airport metadata for our destinations. Let’s select a few variables from this dataset and rename the variables to be more meaningful for being applied to destination airports.
route_summ4 <- left_join(route_summ3, dest_airports)
Joining, by = "dest"
route_summ4
The variable that matches this data to our route_summ3 data is dest, so we can simply join on that to get a new dataset:
route_summ4 <- route_summ4 %>%
group_by(origin, dest) %>%
mutate(min_delay = min(mean_delay), max_delay = max(mean_delay))
route_summ4
Suppose we also want to be able to sort routes according to the absolute earliest and latest arrival across all airlines and months. To do this, we can add new summary columns to our data:
filter(route_summ4, origin == "ATL") %>%
ggplot(aes(month, mean_delay, color = carrier_name, group = carrier_name)) +
geom_line(aes(month, mean_delay), data = mn_arr_delay, color = "gray", size = 1, group = 1) +
geom_point() +
geom_line() +
ylim(c(-31, 47)) +
scale_color_discrete(drop = FALSE) +
facet_trelliscope(~ origin + dest, nrow = 2, ncol = 4, path = "route_delay_atl2")
** note: When inside an R Markdown document, the only way to embed aTrelliscope display within the notebook is to use self_contained = TRUE.
writing panels [=----------------------------------------------------------------] 1% 2/145 eta:20s
writing panels [=----------------------------------------------------------------] 2% 3/145 eta:26s
writing panels [==---------------------------------------------------------------] 3% 4/145 eta:29s
writing panels [==---------------------------------------------------------------] 3% 5/145 eta:31s
writing panels [===--------------------------------------------------------------] 4% 6/145 eta:32s
writing panels [===--------------------------------------------------------------] 5% 7/145 eta:33s
writing panels [====-------------------------------------------------------------] 6% 8/145 eta:33s
writing panels [====-------------------------------------------------------------] 6% 9/145 eta:33s
writing panels [====------------------------------------------------------------] 7% 10/145 eta:34s
writing panels [=====-----------------------------------------------------------] 8% 11/145 eta:34s
writing panels [=====-----------------------------------------------------------] 8% 12/145 eta:34s
writing panels [======----------------------------------------------------------] 9% 13/145 eta:34s
writing panels [======----------------------------------------------------------] 10% 14/145 eta:34s
writing panels [=======---------------------------------------------------------] 10% 15/145 eta:34s
writing panels [=======---------------------------------------------------------] 11% 16/145 eta:34s
writing panels [========--------------------------------------------------------] 12% 17/145 eta:34s
writing panels [========--------------------------------------------------------] 12% 18/145 eta:34s
writing panels [========--------------------------------------------------------] 13% 19/145 eta:34s
writing panels [=========-------------------------------------------------------] 14% 20/145 eta:33s
writing panels [=========-------------------------------------------------------] 14% 21/145 eta:33s
writing panels [==========------------------------------------------------------] 15% 22/145 eta:33s
writing panels [==========------------------------------------------------------] 16% 23/145 eta:33s
writing panels [===========-----------------------------------------------------] 17% 24/145 eta:33s
writing panels [===========-----------------------------------------------------] 17% 25/145 eta:32s
writing panels [===========-----------------------------------------------------] 18% 26/145 eta:32s
writing panels [============----------------------------------------------------] 19% 27/145 eta:32s
writing panels [============----------------------------------------------------] 19% 28/145 eta:32s
writing panels [=============---------------------------------------------------] 20% 29/145 eta:31s
writing panels [=============---------------------------------------------------] 21% 30/145 eta:31s
writing panels [==============--------------------------------------------------] 21% 31/145 eta:31s
writing panels [==============--------------------------------------------------] 22% 32/145 eta:31s
writing panels [===============-------------------------------------------------] 23% 33/145 eta:31s
writing panels [===============-------------------------------------------------] 23% 34/145 eta:30s
writing panels [===============-------------------------------------------------] 24% 35/145 eta:30s
writing panels [================------------------------------------------------] 25% 36/145 eta:30s
writing panels [================------------------------------------------------] 26% 37/145 eta:30s
writing panels [=================-----------------------------------------------] 26% 38/145 eta:30s
writing panels [=================-----------------------------------------------] 27% 39/145 eta:29s
writing panels [==================----------------------------------------------] 28% 40/145 eta:29s
writing panels [==================----------------------------------------------] 28% 41/145 eta:29s
writing panels [===================---------------------------------------------] 29% 42/145 eta:29s
writing panels [===================---------------------------------------------] 30% 43/145 eta:28s
writing panels [===================---------------------------------------------] 30% 44/145 eta:28s
writing panels [====================--------------------------------------------] 31% 45/145 eta:28s
writing panels [====================--------------------------------------------] 32% 46/145 eta:28s
writing panels [=====================-------------------------------------------] 32% 47/145 eta:27s
writing panels [=====================-------------------------------------------] 33% 48/145 eta:27s
writing panels [======================------------------------------------------] 34% 49/145 eta:27s
writing panels [======================------------------------------------------] 34% 50/145 eta:26s
writing panels [=======================-----------------------------------------] 35% 51/145 eta:26s
writing panels [=======================-----------------------------------------] 36% 52/145 eta:26s
writing panels [=======================-----------------------------------------] 37% 53/145 eta:26s
writing panels [========================----------------------------------------] 37% 54/145 eta:25s
writing panels [========================----------------------------------------] 38% 55/145 eta:25s
writing panels [=========================---------------------------------------] 39% 56/145 eta:25s
writing panels [=========================---------------------------------------] 39% 57/145 eta:25s
writing panels [==========================--------------------------------------] 40% 58/145 eta:24s
writing panels [==========================--------------------------------------] 41% 59/145 eta:24s
writing panels [==========================--------------------------------------] 41% 60/145 eta:24s
writing panels [===========================-------------------------------------] 42% 61/145 eta:23s
writing panels [===========================-------------------------------------] 43% 62/145 eta:23s
writing panels [============================------------------------------------] 43% 63/145 eta:23s
writing panels [============================------------------------------------] 44% 64/145 eta:23s
writing panels [=============================-----------------------------------] 45% 65/145 eta:22s
writing panels [=============================-----------------------------------] 46% 66/145 eta:22s
writing panels [==============================----------------------------------] 46% 67/145 eta:22s
writing panels [==============================----------------------------------] 47% 68/145 eta:22s
writing panels [==============================----------------------------------] 48% 69/145 eta:21s
writing panels [===============================---------------------------------] 48% 70/145 eta:21s
writing panels [===============================---------------------------------] 49% 71/145 eta:21s
writing panels [================================--------------------------------] 50% 72/145 eta:20s
writing panels [================================--------------------------------] 50% 73/145 eta:20s
writing panels [=================================-------------------------------] 51% 74/145 eta:20s
writing panels [=================================-------------------------------] 52% 75/145 eta:20s
writing panels [==================================------------------------------] 52% 76/145 eta:19s
writing panels [==================================------------------------------] 53% 77/145 eta:19s
writing panels [==================================------------------------------] 54% 78/145 eta:19s
writing panels [===================================-----------------------------] 54% 79/145 eta:19s
writing panels [===================================-----------------------------] 55% 80/145 eta:18s
writing panels [====================================----------------------------] 56% 81/145 eta:18s
writing panels [====================================----------------------------] 57% 82/145 eta:18s
writing panels [=====================================---------------------------] 57% 83/145 eta:17s
writing panels [=====================================---------------------------] 58% 84/145 eta:17s
writing panels [======================================--------------------------] 59% 85/145 eta:17s
writing panels [======================================--------------------------] 59% 86/145 eta:17s
writing panels [======================================--------------------------] 60% 87/145 eta:16s
writing panels [=======================================-------------------------] 61% 88/145 eta:16s
writing panels [=======================================-------------------------] 61% 89/145 eta:16s
writing panels [========================================------------------------] 62% 90/145 eta:15s
writing panels [========================================------------------------] 63% 91/145 eta:15s
writing panels [=========================================-----------------------] 63% 92/145 eta:15s
writing panels [=========================================-----------------------] 64% 93/145 eta:15s
writing panels [=========================================-----------------------] 65% 94/145 eta:14s
writing panels [==========================================----------------------] 66% 95/145 eta:14s
writing panels [==========================================----------------------] 66% 96/145 eta:14s
writing panels [===========================================---------------------] 67% 97/145 eta:14s
writing panels [===========================================---------------------] 68% 98/145 eta:13s
writing panels [============================================--------------------] 68% 99/145 eta:13s
writing panels [===========================================--------------------] 69% 100/145 eta:13s
writing panels [============================================-------------------] 70% 101/145 eta:12s
writing panels [============================================-------------------] 70% 102/145 eta:12s
writing panels [=============================================------------------] 71% 103/145 eta:12s
writing panels [=============================================------------------] 72% 104/145 eta:12s
writing panels [==============================================-----------------] 72% 105/145 eta:11s
writing panels [==============================================-----------------] 73% 106/145 eta:11s
writing panels [==============================================-----------------] 74% 107/145 eta:11s
writing panels [===============================================----------------] 74% 108/145 eta:10s
writing panels [===============================================----------------] 75% 109/145 eta:10s
writing panels [================================================---------------] 76% 110/145 eta:10s
writing panels [================================================---------------] 77% 111/145 eta:10s
writing panels [=================================================--------------] 77% 112/145 eta: 9s
writing panels [=================================================--------------] 78% 113/145 eta: 9s
writing panels [==================================================-------------] 79% 114/145 eta: 9s
writing panels [==================================================-------------] 79% 115/145 eta: 8s
writing panels [==================================================-------------] 80% 116/145 eta: 8s
writing panels [===================================================------------] 81% 117/145 eta: 8s
writing panels [===================================================------------] 81% 118/145 eta: 8s
writing panels [====================================================-----------] 82% 119/145 eta: 7s
writing panels [====================================================-----------] 83% 120/145 eta: 7s
writing panels [=====================================================----------] 83% 121/145 eta: 7s
writing panels [=====================================================----------] 84% 122/145 eta: 7s
writing panels [=====================================================----------] 85% 123/145 eta: 6s
writing panels [======================================================---------] 86% 124/145 eta: 6s
writing panels [======================================================---------] 86% 125/145 eta: 6s
writing panels [=======================================================--------] 87% 126/145 eta: 5s
writing panels [=======================================================--------] 88% 127/145 eta: 5s
writing panels [========================================================-------] 88% 128/145 eta: 5s
writing panels [========================================================-------] 89% 129/145 eta: 5s
writing panels [========================================================-------] 90% 130/145 eta: 4s
writing panels [=========================================================------] 90% 131/145 eta: 4s
writing panels [=========================================================------] 91% 132/145 eta: 4s
writing panels [==========================================================-----] 92% 133/145 eta: 3s
writing panels [==========================================================-----] 92% 134/145 eta: 3s
writing panels [===========================================================----] 93% 135/145 eta: 3s
writing panels [===========================================================----] 94% 136/145 eta: 3s
writing panels [============================================================---] 94% 137/145 eta: 2s
writing panels [============================================================---] 95% 138/145 eta: 2s
writing panels [============================================================---] 96% 139/145 eta: 2s
writing panels [=============================================================--] 97% 140/145 eta: 1s
building display obj [=============================================================--] 97% 141/145 eta: 1s
writing cognostics [==============================================================-] 98% 142/145 eta: 1s
writing thumbnail [==============================================================-] 99% 143/145 eta: 1s
writing display list [===============================================================] 99% 144/145 eta: 0s
writing app config [===============================================================] 100% 145/145 eta: 0s
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
Now that we have more variables in our data, let’s recreate our Trelliscope display. The only thing we change here is the input dataset, and we place the resulting display in a different directory than our previous one.
Again, if this display didn’t display in your browser as a result of running the previous cell, run the following:
rr cat(‘Your code goes here’)
Now we can see if these new cognostics provide us any more meaningful ways to navigate or understand our data.
Your Turn: Investigate the Trelliscope display and try to answer the following questions: - Can you turn on the dest_name label to give you a helpful hint as to where or what the destination airport code is referring to? - What is the most common destination time zone for flights out of Atlanta? hint: turn on the dest_tzone filter and look at its distribution. - What are the worst destination time zones in terms of more delayed flights? hint: in addition to turning on the dest_tzone filter, turn on the mean_delay_mean filter and then examine how the distribution of this filter changes as you make different selections of the dest_tzone filter. - Which route has the worst delay time? hint: sort on min_delay. ***
It is worth noting that although we are only looking at ~150 plots in the above Trellscope displays, the notion of Trelliscope display conceptually scales to displays of a very large number of panels. We have made displays with panels numbering in the millions. Even though you may have a million subsets of data available to look at, it does not mean that you have to look at all of it, and using an interactive viewer like Trelliscope with cognostics that guide you to interesting areas of your data, it becomes a powerful, flexible, detailed exploratory visualization tool.
For this example, a much more interesting display would be to plot all ~2700 routes in a single display. Currently, trelliscopejs pre-renders all the plots, and since ggplot2 is a bit slow, it would take 10-15 minutes to generate the display. Even with rendering on-the-fly, which was supported in the previous Shiny-powered Trelliscope package, and which will be supported in trelliscopejs soon, ggplot2 can be too slow for a good user experience. Other plotting packages can be used with trelliscopejs. Some packages render much more quickly. We provide pointers to these packages in the next section.
A final point to make about larger Trelliscope displays. The datadr package supports use of arbitrary data structures and R code on a Hadoop cluster. You can generate trelliscopejs displays directly against a very large dataset on the cluster. You can find a tutorial on these techniquesfrom last year’s Strata. In the future, it is our vision for sparklyr to support this capability. You will then have a interactive window into large datasets with sparklyr.
As mentioned in the previous section, if you don’t want to use ggplot2 to generate panels, you are welcome to use other plotting librarys, including conceptually any htmlwidget to generate the panels of your display. This goes beyond the scope and timing of this tutorial, but a good reference on this can be found here.
Your Turn: Create another Trelliscope display showing all routes that originate from your favorite airport, or that originate or have destinations at various airports of interest.
cat('Your code goes here')